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(54) Apparatus for selectably encrypting or decrypting data 



(57) Apparatus (40) for selectably encrypting or de- 
cryptihg data, said apparatus (40) t>eing arranged to re- 
ceive a control signal for selecting between encryption 
and decryption. The apparatus (40) comprises program- 
mable Look-up Tables (LUTs) (60,160,160'). The appa- 
ratus further comprises at least one storage device 
(92,94) for storing a first set and a second set of LUT 



values, the apparatus being an-anged to program some 
or all of said LUTs with said first set of LUT values when 
said control signal is set to encrypt, and to program 
some or all of said LUTs with said second set of LUT 
values when said control signal Is set to decrypt In the 
preferred embodiment, the apparatus is arranged to im- 
plement the Advanced Encryption Standard, or Rijn- 
dael, cipher 
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Description 

FIELD OF THE INVENTION 

5 [0001 ] The present invention relates to the field of data encryption . The invention relates particularly to an apparatus 
for data encryption and data decryption according to private key» or synnmetric key, data encryption algorithms. 

BACKGROUND TO THE INVENTION 

10 [0002] Secure or private communication, particularly over a telephone network or a computer network, is dependent 
on the encryption, or enciphering, of the data to be transmitted. One type of data encryption, commonly known as 
private key encryption or symmetric key encryption, involves the use of a key, normally in the form of a pseudo-random 
number, or code, to encrypt data in accordance with a selected data encryption algorithm (DEA). To decipher the 
encrypted data, a receiver must know and use the same key in conjunction with the inverse of the selected encryption 

IS algorithm. Thus, anyone who receives or intercepts an encrypted message cannot decipher it without knowing the key. 
[0003] Data encryption Is used In a wide range of applcations including IPSec Protocols, ATM Cell Encryption, Secure 
Socket Layer (SSL) protocol and Access Systems for Terrestrial Broadcast. In many applications, the encryption and/ 
or decryption is perfonned in real-time and so it is desirable to perform the encryption and/or decryption as quickly as 
possible. It is also desirable for a data encryption apparatus to be able to perform both encryption and decryption. 

20 [0004] In September 1 997 the National Institute of Standards and Technology (NIST) issued a request for candidates 
for a new Advanced Encryption Standard (AES) to replace the existing Data Encryption Standard (DES). A data en- 
cryption algorithm commonly known as the Rijndael Block Cipher was selected for the new AES. 
[0005] There is a need therefore for a data encryption apparatus that perionns data encryption and data decryption, 
preferably in accordance with the Rijndael algorithm, and at a rate that is suitable for commercial applications, partic- 

25 ulariy real-time applcations. 

Summary of the Invention 

[0006] Accordingly, one aspect of the present invention provides an apparatus for selectably encrypting or decrypting 
30 data, the apparatus being arranged to receive a control signal for selecting between encryption and decryption, the 
apparatus comprising at least one data processing module arranged to perfomn one or more data encryption or data 
decryption operations depending on the setting of said control signal, wherein at least part of said data processing 
module comprises one or more programmable Look-up Tables (LUTs), the apparatus further comprising at least one 
storage device for storing a first set and a second set of LUT values, the apparatus being arranged to program some 
35 or all of said LUTs with said first set of LUT values when said control signal is set to encrypt, and to program some or 
all of said LUTs with said second set of LUT values when said control signal is set to decrypt. 
[0007] In the preferred embodiment, the apparatus comprises a plurality of LUTs all of whk:h are programmed with 
said first set of LUT values during encryptk)n and programmed with said second set of LUT values during decryption. 
In an altematlve embodiment, the apparatus comprises a plurality of LUTs and the first and second sets of LUT values 
^ each comprise a plurality of respective sub-sets of LUT values. During encryption, some of LUTs are programmed with 
a respective one of the sub-sets of LUT values belonging to said first set. During decryption, all of the LUTs are pro- 
grammed with a respective of the sub-sets of LUT values belonging to said second set. 

[0008] Preferably, the apparatus comprises a plurality of instances of a data processing module an-anged in a data 
processing pipeline. 

45 [0009] Preferably, the apparatus is arranged to perfomn encryption or decryption in accordance with the Rijndael 
Block Cipher, wherein the data processing module is an^anged to implement a Rijndael round. More preferably, the 
data processing module is ananged to implement the ByteSub transformation of the Rijndael round in at least one 
LUT. Preferably, said first set of LUT values is arranged to program a LUT to implement the Rijndael ByteSub trans- 
fomnation and said second set of LUT values is an^anged to program a LUT to implement the inverse of the Rijndael 

50 ByteSub transformation. Preferably, the data processing module includes a respective LUT for each byte of an input 
data block. 

[0010] In the alternative embodiment, the entire Rijndael round is implemented using LUTs. 
[001 1] Preferably, the first and second set of LUT values are stored In respective first and second storage devees 
and the apparatus further includes a 2-to-1 selector switch operable by said control signal to select said first storage 
55 device when the control signal is set to encode, and to select said second storage devk:e when the control signal is 
set to decode. More preferably, the apparatus includes two or more sets of a first and a second storage device, each 
first and second storag d vice storing said first and second set of LUT values respectiv ly, the apparatus furth r 
including a respective 2-to-1 s lector switch for each set of first and second storage devk». Preferably, said first and 
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second storage devices are implemented by means of respective Read Only Memories (ROMs). 

[0012] Alternatively, said first and second sets of LUT values are stored in respective storage locations of a single 

storage device and are selectable by a 2'to*1 selector switch. 

[0013] Preferably, each LUT is Implemented by means of a programmable Random Access Memory (RAM) or a 
5 programmable Read Only Memory (ROM). 

[0014] Preferably, the apparatus is implemented on a Field Programmable Gate Array 

[0015] The invention further provides a computer program product comprising computer useable instructions ar- 
ranged to generate, in whole or in part, an apparatus according to the invention. The apparatus may therefore be 
Implemented as a set of suitable computer programs. Typically, the computer program takes the fonm of a hardware 

10 description, or definition, language (HDL) which, when synthesised using conventional hardware synthesis tools, gen- 
erates semiconductor chip data, such as mask definitions or other chip design information, for generating a semkx>n- 
duc^orchip. The invention also provides said computer program stored on a computer useable medium. The invention 
further provides semteonductor chip data, stored on a computer usable medium, arranged to generate, in whole or in 
part, an apparatus according to the invention. 

IS [0016] In the following description of preferred embodiments of the invention, a fully pipelined data encryption and 
decryption apparatus is presented in the context of Implementing the Rijndael algorithm. A skilled person will appreciate 
that at least some of the aspects of the present Invention may equally be employed In the implementation of other 
private key, or symmetric key, encryption/decryption algorithms in which at least some of the data transformations differ 
between encryption and decryption. The Serpent Algorithm is an example of such an algorithm. 

20 [0017] The apparatus, or cores, are conveniently implemented using Foundation Series 2.1 i software on the Virtex- 
E (Trade Mark) FPGA (Field Programmable Gate Array) family of devtees as produced by Xilinx of San Jose, California, 
USA (www.xlllnx.com). A fully pipelined Rijndael data encryption/decryption apparatus requires considerable memory, 
hence, its implementation is ideally suited to the Virtex-E range of FPGAs, which contain devk:es with up to 280 RAM 
Blocks (BRAMs). In the preferred embodiment, the apparatus is implemented on a Virtex XCV3200E-8-CG1156 FPGA 

2s device. There are no known single-chip FPGA implementations of the Rijndael algorithm, which perf omn both encryption 
and decryption. 

[0018] Other aspects of the invention will be apparent to those ordinarily skilled in the art upon review of the following 
description of specific embodiments and with reference to the accompanying drawings. 

30 BRIEF DESCRIPTION OF THE DRAWINGS 

[001 9] Embodiments of the invention are now described by way of example and with reference to the accompanying 
drawings in which: 

35 Figure 1 a Is a representation of data bytes arranged in a State rectangular array; 

Figure lb is a representation of a cipher key an-anged in a rectangular array; 
Figure 1c is a representation of an expanded key schedule; 

40 

Figure 2 is a schematk: illustration of the Rijndael Block Cipher; 

Figure 3 is a schematk: illustration of a nomnal Rijndael Round; 

45 Figure 4 is a schematk: representation of a prefered embodiment of an apparatus according to the invention; 

Figure 5 is a schematk: representation of a data processing module included in the apparatus of Figure 4; 

Figure 5a is a schematic representatk}n of a MixCol transformation module included in the data processing module 
so of Figure 5; 

Figure 6 is a representation of a data bkx:k in State form; 

Figure 7 is a table of LUT values for use during encryptk)n; 

55 

Figure 8 shows computer program code for implementing a multiplier block; 

Figure 9 shows a flow chart for implementing the Rijndael key schedule for a 128-bit cipher key; 



3 



EP 1 246 389 A1 



Figure 9a shows a flow chart for implementing the Rijndae! key schedule for a 192-bit cipher key; 

Figure 9b shows a flow chart for implementing the RijndacI key schedule for a 256-bit cipher key; 

5 Figure 1 0 is a table of LUT values for use during data decryption; 

Figure 11 is a schematic representation of an an^angement for initialising LUTs according to the invention; 

Figure 1 2a is a schematic representation of the nomial Rijndael Round for use during encryption by an altematlve 
10 embodiment of the invention; 

Figure 1 2b is a schematic representation of the normal Rijndael Round for use during decryption by an altematlve 

embodiment of the invention; 

f5 Figure 13 Is a schematic representation of a Rijndael Round module for Implementing data encryption in the al- 

ternative embodiment of the invention; 

Figure 14 is a table of LUT values (or use in the module of Figure 13; 

20 Figure 15 Is a second table of LUT values for use in the module of Figure 13; and 

Figure 1 6 is a schematic representation of a Rijndael Round module for implementing data decryption in the al- 
ternative embodiment of the invention; 

25 DETAILED DESCRIPTION OF THE DRAWINGS 

1 . The Rijndael Algorithm 

[0020] The Rijndael algorithm is a private key, or symmetric key, DEA and is an iterated block cipher. The Rijndael 
30 algorithm (hereinafter "Rijndael") Is defined In the publication "The Rijndael Block Cipher: AES proposal" by J. Daemen 
and y. Rljmen presented at the First AES Candidate Conference (AES1 ) of August 20-22. 1 998, the contents of whk:h 
publication are hereby Incorporated herein by way of reference, 

[0021] In accordance with many private key DEAs, including Rijndael, encryption is perfomned In multiple stages, 
commonly known as Iterations, or rounds. Such DEAs lend themselves to implementation using a data processing 

35 pipeline, or pipelined architecture. In a pipelined architecture, a respective data processing module is provided for each 
round, the data processing modules being arranged in series. A message to be encrypted is typcally split up into data 
blocks that are fed in series Into the pipeline of data processing modules. Each data block passes through each process- 
ing module in turn, the processing modules each periorming an encryption operation (or a decryption operation) on 
each data block. Thus, at any given moment, a plurality of data blocks may be simultaneously processed by a respective 

40 processing module - this enables the message to be encrypted (and decrypted) at relatively fast rates. 

[0022] Each processing module uses a respective sub-key. or round key, to periomi its encryption operation. The 
round keys are derived from a primary key, or cipher key. 

[0023] With Rijndael. the data block length and cipher key length can be 128, 192 or 256 bits. The NIST requested 

that the AES must implement a symmetric block cipher with a block size of 128 bits, hence the variations of Rijndael 
45 whk:h can operate on larger block sizes do not form part of the standard itself. Rijndael also has a variable number of 

rounds namely. 10, 12 and 14 when the cipher key lengths are 128. 192 and 256 bits respectively. The following 

description of the invention relates primarily to an embodiments in which the apparatus is arranged to use a 128-blt 

cipher key although a skilled person will appreciate that alternative embodiments may readily be created to Implement 

other key lengtiis. Including 192 or 256 bits. 
50 [0024] With reference to Figure 1 a, the transformations pert omned during the Rijndael encryption operations consider 

a data block as a 4K»lumn rectangular array, or State (generally Indicated at 10 in Figure la), of 4-byte vectors 12. 

For example, a 128-bit plaintext (i.e. unencrypted) data block consists of 16 bytes, Bq, B^, Bg, Bg. B4... B^^, 6^5. IHence. 

In the State 10, Bq becomes Pqo. becomes q. ^2 becomes Pgo B4 becomes Pq -1 and so on. 

[0025] With reference to Figure lb, the cipher key is also considered to be a multi-column rectangular array 14 of 
55 4-byte vectors 1 6, the number of columns, N/^ depending on the cipher key length. In Figure 1 b, the vectors 1 6 headed 

by bytes Kq 4 and K0.5 are present when the cipher key length is 192-bits or 256-blts, while the vectors 16 headed by 

byt s Ko,6 and Kq 7 are only present when the ciph r key length Is 256-blts. 

[0026] Ref rring now to Figure 2, th r is shown, generally indcated at 20, a schematic repres ntatlon of Rijndael. 
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The algorithm design consists of an initial data/key addition operation 22, in which a plaintext data block is added to 
the cipher key, followed by nine, eleven or thirteen rounds 24 when the key length is 128-bits. 192-bits or 256-bits 
respectively and a final round 26, which is a variation of the typical round 24. There is also a key schedule operation 
28 for expanding the cipher key in order to produce a respective different round key for each round 24, 26. 

5 

1.1 The Rijndael Round 

[0027] Figure 3 illustrates the typical Rijndael round 24. The round 24 comprises a ByteSub transformation 30, a 
ShiftRow transformation 32, a MixColumn transformation 34 and a Round Key Addition 36. The ByteSub transformation 
10 30, which is also known as the s-box of the Rijndael algorithm, operates on each byte in the State 1 0 independently. 
[0028] In Rijndael, finite field mathematk^s Is used when manipulating data. For example, a byte, b: b7b6bSb4b3 
b2b1b0 \s considered as a polynomial with coeff k:tents In the finite field, {0, 1 }. The polynomial may be represented as: 

byx^ + bgx^ 4- b^x^ 4- b^x* + b^x^ + b2X^ + b^x + bp 

[0029] This polynomial representation of the byte allows mathematteal operations such as addition, multiplcation 
and multiplicative inverse to be perfomied relatively simply. For example, the addition of two bytes is achieved by 
summing, modulo 2. the respective polynomial coeffteients. In binary notation, this corresponds to a simple bitwise 
20 XOR operation. Finite field mathematk:s is well known and for further information reference is made to the publication 
'Introduction to Finite Fields and their Applications" by R. Lid! and H Niederreiter, Cambridge University Press, Revised 
Edition. 1994. 

[0030] The s-box 30 involves finding the multiplicative inverse of each byte in the finite, or Galois, field GF(2®). An 
affine transfomiation is then applied, whk:h involves multiplying the result of the multiplicative inverse by a matrix M 
25 (as defined In the Rijndael specification) and adding to the hexadecimal number '63' (as is stipulated in the Rijndael 
specification). 

[0031 ] I n the ShiftRow transf onmation 32, the rows of the State 1 0 are cyclfcally shifted to the left. Row 0 is not shifted, 
row 1 is shifted 1 place, row 2 by 2 places and row 3 by 3 places. 

[0032] The MixColumn transfomnation 34 operates on the columns of the State 10. Each column, or 4-byte vector 
30 12, Is considered a polynomial over GF(2^) and multiplied modub x^+l with a fixed polynomial c(x), where, 

c(x)= 'OS'x^ + '01 'x^ + '01 'X + '02' (1) 

^ (the inverted commas surrounding the polynomial coeff k:ients signifying that the coeffkjients are given in hexidedmal). 
[0033] Finally in Round Key Addition 36, tne State 10 bytes and the round key bytes are added by a bitwise XOR 
operation. 

[0034] In the final round 26, the MixColumn transf orniation 34 is omitted. 
^ 1 .2 Key Schedule 

[0035] The Rijndael key schedule 28 consists of two parts: Key Expansion and Round Key Selection . Key Expansion 
involves expanding the cipher key into an expanded key, namely a linear an-ay 1 5 (Fig. 1 c) of 4-byte vectors or words 
17, the length of the array 15 being determined by the data block length, /V^ (in bytes) multiplied by the number of 
^ rounds, plus 1. i.e. array length = '(A/^ + 1). In Rijndael, the data block length is four bytes. = 4. 

When the key block length, /V,f = 4. 6 and 8. the number of rounds is 1 0, 1 2 and 1 4 respectively. IHence the lengths of 
the expanded key are as shown in Table 1 below. 



Table 1. 



Length of Expanded Key for Varying Key Sizes 


Data Block Length, N^, 


4 


4 


4 


Key Block Length, N^^ 


4 


6 


8 


Number of Rounds, 


10 


12 


14 


Expanded Key Length 


44 


52 


60 



5 



EP 1 246 389 A1 



[0036] The first A/^ words of the expanded key comprise the cipher key. When A/^ ^ 4 or 6, each subsequent word, 
W[il, is found by XORing the previous word, W[i-1 ]. with the word positions earlier, \N[\-Nfj. For words 1 7 in positions 
whteh are a multiple of A/^, a transfomiatlon is applied to W[i-1 ) before it is XORed. This transformation involves a cyclk: 
shift of the bytes In the word 17. Each byte is passed through the Rljndael s-box 30 and the resulting word is XORed 
5 with a round constant stipulated by Rijndael (see Rcon^/; function described below). However, when A/^^S. an additional 
transfomnation is applied: for words 1 7 in positions which are a multiple of ((A//i)+ 4), each byte of the word, W[i-1], is 
passed through the Rijndael s-box 30. 

[0037] The round keys are selected from the expanded key 16. In a design with rounds, Nr^^ round keys are 
required. For example a 10-round design requires 11 round keys. Round key 0 comprises words W[OJ to W[31 of the 
10 expanded key 15 and is utilised in the initial data/key addition 22, round key 1 comprises W[4] to W[7] and is used in 
round 0, round key 2 comprises W[8] to W[11] and is used in round 1 and so on. Finally, round key 10 is used in the 
final round 26. 

1 .3 Decryption 

15 

[0038] The decryption process in Rijndael is effectively the inverse of its encryption process. Decryption comprises 
an inverse of the final round 26, inverses of the rounds 24, followed by the initial data/key addition 22. The data/key 
addition 22 remains the same as it involves an XOR operation, which is its own inverse. The inverse of the round 24, 
26 is found by inverting each of the transformations in the round 24. 26. The inverse of ByteSub 30 is obtained by 
20 applying the inverse of the affine transformation and taking the multiplteative inverse in GF(2®) of the result. In the 
inverse of the ShiftRow transformation 32, row 0 is not shifted, row 1 is now shifted 3 places, row 2 by 2 places and 
row 3 by 1 place. The polynomial, c(x), used to transform the State 1 0 columns in the inverse of MixColumn 34 Is given 
by. 

= 'OB'x^ + 'OD'x^ + '09*x + 'OE' (2) 

[0039] Similarly to the data/key addition 22, Round Key addition 36 Is its own inverse. During decryption, the key 
schedule 28 does not change, however the round keys constructed for encryption are now used in reverse order. For 

^ example, in a 10-round design, round key 0 is still utilized in the Initial data/key addition 22 and round key 10 in the 
final round 26. However, round key 1 is now used in round B, round key 2 in round 7 and so on. 

2. Implementation of the Rijndael algorithm 

^ [0040] A number of different architectures can be considered when designing an apparatus or circuit for implementing 
encryption algorithnns. These include Iterative Looping (IL), where only one data processing module is used to imple- 
ment all of the rounds. Hence for an rhround algorithm, n iterations of that round are earned out to perform an encryption, 
data being passed through the single instance of data processing module n times. Loop Unrolling (LU) involves the 
unrolling of multiple rounds. Pipelining (P) is achieved by replk:ating the round i.e. devising one data processing module 

^ for implementing the round and using multiple instances of the data processing module to impleoient successive rounds. 
In such an architecture, data registers are placed between each data processing module to control the flow of data. A 
pipelined architecture generally provides the highest throughput. Sub-Pipelining (SP) is carried out on a partially pipe- 
lined design when the round is complex. It decreases the pipeline's delay between stages but increases the number 
of clock cycles required to perform an encryption. A fully pipelined architecture is pretended for the apparatus of the 

^ invention as this provides the highest throughput. It will be understood however that the invention may alternatively be 
applied to a sub-pipelined or iterative loop architecture. 

[0041] An embodiment of a data encryption and decryption apparatus according to the invention is now described. 
Figure 4 shows an apparatus, or core, generally indicated at 40, forselectably encrypting or decrypting data according 
to the invention. 

^ [0042] The apparatus 40 comprises a fully pipelined architecture including a pipeline of data processing modules 44 
(hereinafter 'round modules 44') each amanged to implement the typk:al Rijndael round 24 and a data processing 
module 46 (hereinafter 'round module 46') arranged to implement the Rijndael final round 26. Storage elements in the 
form of data registers 42 are provided before each round module 44, 46. For illustrative purposes, the apparatus 40 
Is shown as implementing ten rounds and so corresponds to the case where both the input plaintext block length and 

^ the cipher key length are 12B-bits. It will be understood from the foregoing description that the number of rounds 
dep nds on the cipher k y length. 

[0043] The apparatus 40 also includes a data/key addition module 48 arranged to inrtplement the data/key addition 
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operation 22 and a key schedule module 50 arranged to implement the key schedule 28 operations. 

[0044] The implementation of the modules 44, 46, 48 and 50 is now described in more detail. 

[0045] The Data/Key Addition module 46 comprises an XOR component (not shown) arranged to perfomi a bitwise 

XOR operation of each byte B{ of the State 1 0 comprising the input plaintext, with a respective byte K| of the cipher key. 

5 [0046] Referring now to Figure 5, there is shown a preferred implementation of the round module 44. The round 
module 44 includes a ByteSub module 52 arranged to implement the ByteSub transformation 30, a ShiflRow module 
54 arranged to implement the ShiftRow transfomnation 32, a MixCol module 56 arranged to implement the MixCol 
transformation 34 and a Key addition module 58 arranged to implement the Key addition operation 36. 
[0047] A major consideration In the design of the apparatus 40 is the memory requirement. The ByteSub module 52 

10 is therefore advantageously implemented as one or more look-up tables (LUTs) or ROMs. This is a faster and more 
cost-effective (in terms of resources required) implementation than implementing the multiplicative inverse operation 
and affine transformation in logic. Figure 6 shows, as the round input, an example State 10 in whbh the sixteen data 
bytes are labeled Bq to B^5. Since the State bytes Bq to B^s are operated on individually, each ByteSub module 52 
requires sixteen 8-bit to 8-bft LUTs, The Xilinx Virtex-E (T rade Mark) range of FPGAs are preferred for Implementation 

IS as it contains FPGA devices with up to 280 BlockSelectRAM (BRAM) (Trade Mark) storage devrces, or memories. 
Conveniently, a single BRAM can be configured into two single port 256 x 8-bit RAMs (a description of how to use the 
Xilinx BRAM Is given in the Xilinx Application Note XAPP130: Virtex Series; using the Virtex Block Select RAM + 
Features; URL: http-7/www.xinnx.com; March 2000). Hence, when using a Virtex FPGA, eight BRAMs are used in 
each ByteSub module 52 to implement the 1 6 LUTs, since each of the two RAMs in each respective BRAM can serve 

20 as an 8-bit to 8-brt LUT (when the write enable input of the RAM is low ('0'), transitions on the write clock input are 
ignored and data stored in the RAM is not affected. Hence, if the RAM is initialized and both the input data and write 
enable pins are held tow, then the RAM can be utilized as a ROM or LUT). Figure 7 shows a table giving the hexadecimal 
values required in an LUT for implementing the ByteSub transformation 30 during Rijndalel encryption. The values 
given in Figure 7 are set out in ascending order in rows reading from left to right. Thus, row 0 of the table gives the 

25 LUT outputs for input values from *00* to '07' (hexadecimal), row 1 gives the LUT output values for input values from 
'08' to 'OF and so on until row 31 gives the LUT output values for inputs 'F8' to 'FF*. For example, an input of W 
(hexidecimal) to the LUT returns the output '63* (hexidecimal), an Input of *8A' (hexidecimal) to the LUT retums the 
output '7E' (hexidecimal) (row 17) and 'FF* gives the output '16*. 

[0048] In Figure 5, the BRAMs are enumerated as 60. Each BRAM 60 in the ByteSub module 52 operates on two 
30 State bytes at a time. Each State byte Bq to B^5 is provided as the input to a respective one of the 1 6 single port RAMs 
(not shown) provided by the 8 BRAMs 60. Thus, each BRAM 60 in the ByteSub module 52 operates on two State bytes 
at a time. The respective resulting outputs of the BRAMs 60 are then provided as the input to the ShiftRow module 54, 
again in State format as shown in Figure 6. 

[0049] In the ShiftRow module 54. the required cyclrcal shifting on the rows of the State 1 0 is conveniently performed 
35 by appropriate hardwiring arrangements as shown in Figure 5. Row 1 and Row 3 of the State 10 are operated on 
differently during encryption and decryption. In the respective data lines 62, 64 for Row 1 and Row 3. the ShiftRow 
module 54 therefore includes selectable alternative hardwiring anrangements 66, 68 for Row 1 and 70, 72 for Row 3. 
The alternative hardwiring arrangements 66, 68 and 70. 72 are selectable via a respective switch, or 2-to-1 multiplexer 
74, 76. depending on the setting of a control signal EndDec. The control signal End Dec \s, generated extemally of the 
40 apparatus 40 and determines whether or not the apparatus 40 perfomns data encryption or data decryption. During 
encryption, hardwiring arrangement 66 is selected for data line 62 while hardwiring arrangement 70 is selected for data 
line 64. During decryption, hardwiring an^angement 68 is selected for data line 62 while hardwiring arrangement 72 is 
selected for data line 64. The resulting State 1 0 output from the Shiftrow module 54 is provided to the MixCol module 
56, which is shown in Figure 5a. 
45 [0050] The MixCol module 56 transforms each column (Col 0 to Col 3) of the State 1 0. Each column is considered 
a polynomial over GF(2^) and multiplied modulo x^+1 with a fixed polynomial c(x) as set out in equation [1 ] for encryption 
and equatk)n [2] for decryption. This can be considered as a matrix muttiplk:atlon as follows: 

During encryption: 

so 
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01 
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01 
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01 


01 
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During decryption: 



'bo' 




'QE 


OB 


OD 


09" 


'do' 






09 


0£ 


OB 


OD 




h 




OD 


09 


OE 


05 




A. 




05 


OD 


09 


OE 





[41 



10 

[0051] Where the input to the MixCol module 56 may be denoted In State fomiat as follows: 

15 



20 





ColO 


Coll 


Col 2 


Col 3 


RowQ 


ao 


34 


38 


312 


Rowl 


ai 


35 


39 


313 


Row 2 


32 


36 


aio 


ai4 


Row 3 


33 


a? 


ail 


ais 



25 

[0052] And the output of the output may be denoted In State fomnat as: 



ColQ Coil Col 2 Col 3 



35 



RowO 


bo 


b4 


bs 


bi2 


Rowl 


bi 


bs 


b» 


b,3 


Row 2 


b2 


be 


bio 


b,4 


Row 3 


b3 


b7 


bii 


bis 



40 [0053] Equations [3] and [4] illustrate the matrix multiplication for the first column [^-a^ of the input State to produce 
the first column [b^-b^ of the output State. The MIxCol module 56 perfomis the same multiplication for the remaining 
columns of the input state to produce corresponding output State columns. The values given In the multiplication ma- 
trices in [3] and [4) correspond respectively with the coefficients of the fixed polynomial c(x) given in equations [1] and 
[2]. These values are specific to the Rijndael algorithm. 

45 [0054] The matrix multiplication required for the MIxCol transfomiation can be implemented using sixteen GF(2®) 
B-bit multiplier blocks 78 (Figure 5a) arranged in four columns of four. The MixCol module 56 operates on one column 
of the input State at a time. Each multiplier block 78 in each column operates on the same input State byte. Thus for 
the first input State column [aQ-a3l, each of the multipliers 78 in the first column operate on a^, the multipliers 78 in the 
second column operate on a^ and so on. In general, the first column of multipliers 78 operates on input State byte a4^j), 

so the second column of multipliers operate on input State byte 34^}^^), the third column on input State byte 34^1^2) 

fourth column on input State byte e^^u^y where i = 0 to 3 and corresponds to columns 1 to 4 of the input State. Each 
multiplier block 78 is also provided with a second input for receiving one of two possible multiplk^ation coeffk:ients 
whose respective values are detemnined by the multiplfcation matrices in [3] and [4]. For each multiplier block 78,the 
respective coefficients are selectable by means of a respective switch, or 2-to-1 multiplexer 86 that is operable by the 

55 control signal EndDec. The output State is produced a column at a time [b4(i).b4^|+i),b4^j^^),b4^^^i)), tor i = 0 to 3, where 
the first output State byte In each column Is obtained by combining each of the first multiplier blocks 78 In each mult^ller 
block column using a respective XOR gat 80. 

[0055] Figure 8 provides suitable VHDL (V ry high speed integrated circuit Hardware Description Language) code 
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for generating the multiplier blocks 78. In which the inputs A and B given in the code correspond respectively with the 
first and second inputs of the multiplier blocks, and C is the product of A and B. 

[0056] The MixCol module 56 produces an output in State 10 torn that is provided as an input to the key addition 
module 58. The key additran module 58 is provided with the respective round key as a second input. The round key 
5 is equal in length to the data block length % and thus comprises 1 6 bytes Kj, where i = 0 to 1 5. The key addition module 
58 comprises an XOR component 90 anranged to perfomri a bitwise XOR operation of each byte Bj of the input State 
10 with a respective byte K| of the round key. 

[0057] The result is the Round Output, in State 1 0 form, which is provided to the next stage in the pipeline as appro- 
priate. 

10 [0058] The round module 46 for the final round Is the same as the round module 44 except that the MIxCol module 
56 is omitted. 

[0059] The apparatus 40 also includes a key schedule module 50 arranged to implement the key schedule 28. In 
Figure 9, there is shown a flow chart Illustrating the key expansion part (operations 905 to 945) and the round key 
selection part (operations 955 to 970) included In the key schedule 28. The flow chart of Figure 9 relates to the case 

IS where the key block length = 4, the data block length = 4 and the number of rounds = 10. Alternative flow 
charts are given in Figures 9a and 9b for the case where the key lengths are 192 bits and 256 bits respectively 
[0060] Referring now to Figure 9 (numerals in parentheses() referring to the drawing labels), the input to the key 
schedule module 50 is the cipher key which is assigned to the first four words W[OI to W[3J of the expanded key (905). 
A counter / (whteh represents the position of a word within the expanded key) is set to four (910). The word W[^1] 

20 (which initially is W[3]) is assigned to a 4-byte word Temp (915). A remainder function rem is performed on the counter 
/to determine If its cun-ent value is a multiple of Nfc, whfch In the present example is equal to 4 (920). If the result of 
the rem function is not zero i.e. if the counter value is not exactly divisible by 4, then the word W[M] is XORed with 
the word currently assigned to Temp to produce the next word W[/] (950). For example, when / = 5, W[5] is produced 
by XORIng W[1] with W[4]. 

25 [0061] The value of counter / Is then tested to check if all the words of the expanded key have been produced - 44. 
wonte are required in the present example (946). If /is less than 44 i.e. the expanded key is not complete, then counter 
i is incremented (946) and control returns to step 915. 

[0062] If the result of the rem function is zero (920), this indicates that the word currently assigned to Temp is in a 
position that is a multiple of and so requires to undergo a transformation. A function RotByte is performed on the 
30 word assigned to Temp, the result being assigned to a 4-byte word « (925). The RotByte function involves a cyclical 
shift to the left of the bytes in a 4-byte word. For example, an input of (Bq, B-,, 82, B3) will produce the output (B^, B2, 
B3. Bo). 

[0063] A function SubByte is then performed on R (930), the result being assigned to a 4-byte word S. SubByte 
operates on a 4-byte word and involves subjecting each byte to the ByteSub transfonnation 30 described above. 
35 [0064] The resulting word S is XORed with the result of a function Rconlxl where x = M, the result being assigned 
to a 4-byte word 7(935). RconM returns a 4-byte vector, Rcon[x]^ (RC(x), 'OO*. '00\ 'OO*). where the values of RC[x] 
are as follows: 



RC[11 = '01' 


RC(2] = '02' 


RC[31 = 04' 


RC[4] = '08' 


RC[5] = '10' 


Rq6] = w 


RC[7] = '40' 


RC[8] = '80' 


RC[9] = '1B' 


RC(101 = '36' 



[(K)65] The word \N[h4] is then XORed with the word currently assigned to T to produce the next word WH (940). 
[0066] The value of counter / is then tested to check if all the words of the expanded key have been produced (945). 
If / is not less than 43 then the expanded key is complete. 

[0067] To perform round key selection, a second counter j (whk:h represents a round key index) is set to zero (960). 
Four 4-byte words W[4jl to W[4j+3] are assigned to Round KeyO) (965) for j = 0 to 1 0 (965, 970). Thus, for a ten round 
encryption/decryption, eleven round keys are provided, round key 0 to round key 10 where round key 0 comprises 
words W[0] to W[3] of the expanded key (i.e. the original cipher key), round key 1 comprises words W[4) to W[71 of the 
expanded key, and so on (See Fig. 1 c). Round key 0 ts used by the data/key addition module 48, round key 1 is provided 
to the round module 44 for round 1 , round key 2 is provided to the round module 44 for round 2 and so on until round 
key 10 is used in the round module 46 for the final round (see Figs 4 and 5). 

[0068] The round keys are created as required, hence, round key 0 is available Immediately, round key 1 is created 
one clock cycle later and so on. 

[0069] The flow chart of Figure 9 may readily be coded using a conventional hardware description language (HDL), 
such as VHDL, and the code used to generate con-esponding circuitry using a conventional hardware synthesis tools. 
[0070] In the key schedule modul 50, LUTs can als b used to implement logk: functions. In particular, som words 
are subjected to the ByteSub module 30 during k y expansion (see operation 930 in Figure 9) and this is preferably 
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implemented using one or more LUTs (not shown). The content of the LUTs dunng encryption ,s sam^^^^^ 
Figure 7. For example, in an apparatus 40 utilizing a 128-bit key, forty words are created dunng expansion of the key 
and every fourth word is passed through the Rljndael s-box (i.e. subjected to the ByteSub transfom^ation 30) with each 
byte in the word being transformed, making a total of forty bytes requinng transfomiallon. In the preferred ernbod.nnent. 

5 therefore forty 8-bit to 8-brt LUTs (not shown) are included In the key schedule module 50. When using Xilmx Virtex 
BRAMs to implement these. 20 BRAMs are req uired. Thus, to implement the round modules 44, 46 and the key schedu e 
50 a total of 1 00 BRAMs are required. 80 BRAMs are required for the 1 0 rounds and a further 20 for the key schedule 
module 50 Similarly. 112 BRAMs are required for a 1 92-bit version of the apparatus (96 for the 12 rounds and 16 for 
the key schedule) and 138 for a 256-bit version (112 for the 14 rounds and 26 for the key schedule). 

10 [0071] In the decryption operation, the inverse of the ByteSub transformation 30 is also advantageously implemented 
as a LUT or ROM However, the LUT values for decryption are different to those required for encryption. Figure 10 
shows the Hexadecimal values contained in a LUT during decryption for implementing the inverse of the ByteSub 
transformation 30. The layout of the table shown in Figure 10 is the same as described for Figure 7. For example, an 
input of W (hexadecimal) would return the output. •52', while an input of 'FP retums the output 7D'. 

IS r00721 There are a number of ways to arrange for the apparatus 40 to perform both encryption and decryption. One 
method involves doubling the number of BRAMs. or other LUTs/ROMs, utilised (one set of BRAMs/LUTs being used 
for encryption and another set being used for decryption). However, this approach is costly on area. 
[00731 The prefen-ed approach is illustrated in Figure 11 . 

Figure 11 shows two representative ByteSub modules 52 (the ones for round 0 and for the final Round respectively) 

20 as described with reference to Figure 5. Each ByteSub module 52 comprises a plurality of LUTs. or ROMs, which .n 
the present example are provided by eight BRAMs 60. each BRAM providing two 8-bit to 8^it LUTs in the fomi of its 
respective two single port RAMs. Two further storage devbes. in the forni of ROMs 92, 94. are provided to store the 
respective LUT values required for encryption and decryption (as shown in Figures 7 and 1 0 respectively). Conveniently. 
ROMs 92 94 can be implemented using one or more BRAMs (assuming Implementation in a Virtex FPGA). configured 

25 to serve as ROMs, one containing the initialisation values for the LUTs required during encryption, the other containing 
the values for the LUTs required during decryption. Ttie ROMs 92, 94 are selectable via a 2^to-^ selector switch, or 
2-to-1 multiplexer 96, that is operable by the control signal EndDea Refening back to Figure 4. the ROMs 92. 94 and 
the multiplexer 96 are included in a RAM initialiser module 47, the output from the RAM initialiser module 47 (which 
output corresponds with the output of the multiplexer 96) being provided to each of the round modules 44. 46 in order 

30 to initialise the BRAMs in the respective ByteSub modules 52 (as shown in Figure 1 0) with the appropnate LUT values. 
Thus when the apparatus 40 is required to perform data encryption (and thecontrol signal EndDec')s set accordingly), 
all the BRAMs 60 in the ByteSub modules 52 are initialised with data read from the ROM 92 containing the values 
required for encryption. When the apparatus 40 required to perfomn data decryption (and the control signal EncfDec 
is set accordingly), all the BRAMs 60 in the ByteSub modules 52 are initialised with data read from the ROM 94 con- 

35 talning the values required for decryption. 

[0074] The initialisation of the BRAMs 60 for either decryption or encryption takes 256 clock cycles as the 256 LUi 
values are read from ROM 92 or ROM 94 respectively. For a typical system clock of 25.3 MHz, this conresponds to an 
initialisation time delay of only lOus. When encrypting data, the keys are produced as each round requires them. 
[0075] Therefore, data encryption takes 1 0 clock cycles, corresponding to tiie 1 0 rounds when using a 1 28-bit key 

40 Data decryption takes 20 dock cycles, 1 0 clock cycles for the required round keys to be constructed and a further 1 0 
cycles corresponding to the 1 0 rounds. 

[0076] It will be appreciated ttiat the initialisation ROMs 92. 94 may be implemented using a single BRAM since a 
BRAM can be configured to serve as two 256 x B-bit RAMs, each of whk:h may be configured to operate as a ROM. 
In the preferred embodiment, however, each ROM 92. 94 is implemented using a respective BRAM, with each BRAM 
45 being arranged to store the respective encryptton or decryption LUT values in both RAMs provided by that BRAM. 
Using the BRAM resources in this way simplifies the wiring required in the FPGA since two ROMs (i.e. the appropnately 
configured RAMs) witii the appropriate LUT values are now provided to initialise the BRAMs in the round modules 44, 

46 for encryption, and a further two ROMs with the appropriate LUT values for decryption are also available. When 
two BRAMs are used in this way. the multiplexer 96 is supplemented by a second 2-to-1 multiplexer (not shown), each 

50 of the two multiplexers having one input connected to a respective ROM holding encryption values, the other input 
being connected to a respective ROM holding decryption values. Both multiplexers are operable by the control signal 
EndDec io produce a respective output. With this anangement. two output lines are available from the RAM initialiser 

47 (only one shown In Fig. 4) for initialising the BRAMs in the round modules 44, 46 and this simplifies the wiring in 
the FPGA. It wilt be appreciated that, equally, further BRAMs. or ROMs, may be used in a similar manner to further 

55 simplify ttie wiring if desired. 

[0077] During decryption, the values of the LUTs utilised in the key schedule module 50 are the same as those 
required for ncryption. H nee, ttie LUTs in th key schedule module 50 can conv niently b implement d as ROMs 
(where BRAMs are used, they can be configured to act as ROMs as described above). How ver, the round keys for 
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decryption are used in reverse order to that used in encryption. Therefore, for the 128-bit key encryptor/decryptor 
apparatus 40, if data decryption is can-led out initially, it is necessary to wait 20 clock cycles before the respective 
decrypted data appears (10 clock cycles for the construction of the 1 0 round keys and 1 0 clock cycles corresponding 
to the number of rounds in the apparatus 40). If encrypting data or previously encrypted data is being decrypted, this 
5 initial delay is only 1 0 clock cycles as the round keys do not necessarily need to be reconstructed. Overall, therefore, 
the apparatus 40 uses 102 BRAf^s although the apparatus only requires 202 LUTs in total: 160 for the rounds, 40 for 
the key schedule and 2 for the initialisation ROMs. 

[0078] Although the apparatus 40 is arranged to perform both encryption and decryption, a skilled person will ap- 
preciate that the apparatus 40 nDay be modified to perform encryption only or decryption only, if desired. For an en- 
10 cryption only or decryption only apparatus, the RAM initialiser 47 is not necessary, nor is the control signal EndDec 
and associated switches. Each LUT in the round modules may be implemented as a ROM and initialised with the 
appropriate LUT values from Figure 7 or 10. 

[0079] The apparatus 40, orthe encryption only/decryption only version, is preferably implemented using Xllinx Foun- 
dation Series 2.1 i software and Synplify Pro V6.0 on Xllinx VIrtex-E FPGA devices. Input data blocks can be accepted 
15 every clock cycle and after an initial delay (see above) the respective encrypted/decrypted data blocks appear on 
consecutive clock cycles. On the Virtex-E XCV3200e-8-cg1 1 56 device, the apparatus 40 utilizes 7576 CLB sices (23%) 
and 1 02 BRAMs (49%). Of lOBs 385 of 804 are used. The design uses a system clock of 25.3 MHz and runs at a data- 
rate of 3239 Mbits/sec (405 Mbytes/sec). There are no known similar single-chip FPGA encryptor/decryptor Implemen- 
tatbns. Also, the results obtained compare very well with existing ASIC implementations, as illustrated in Table 2 below. 

20 

Table 2. 



Specifk:atlons of Rijndael ASIC Implementations 




Device 


Throughput (Mbits/sec) 


Ichikawa, Kasuya.Matsul [2] 


CMOS 


1950 


Weeks, Bean, Rozytowicz, FIcke [5] 


CMOS 


5163 


Invention 


XCV3200E 


3239 



[0080] The perfonnance results obtained for an encryption only apparatus are similar to those of an apparatus with 
only decryption capabilities. The main difference in the two implementations is the initial delay time as mentioned 
above. For example, a 126-bit key encryption only design implemented on the Virtex-E XCV81 2e-8bg560 device, 
utilizes 2222 CLB slices (23%) and 1 00 BRAMs (35%). Of tOBs 384 of 404 are used. The design uses a system dock 
of 54.35 MIHz and runs at a data-rate of 7 Gblts/sec (870 Mbytes/sec). These results prove faster than similar existing 
FPGA Implementations, as iilustrated in Table 3 below. 



Table 3. 



Specifications of 128-bit Key Rijndael Encryption FPGA Implementations 




Type 


Device 


Area (CLB Slices) 


Throughput (IMblts/sec) 


Gaj, Chodowiec [3] 


IL 


XCV1000 


2902 


331.5 


Elbirt, Yap, Chetwynd, Paar [4] 


SP 


XCV1000 


9004 


1940 


Dandalls, Prasanna, Rolim [1] 


IL 


Virtex 


5673 


353 


Invention 


P 


XCV812E 


2222 


6031 



[0081] The apparatus 40 may alternatively be implemented on the new Xllinx Virtex II family of FPGA devk^. 
[(X)82] Rijndael is set to be approved by NIST and replace DES as the Federal Information Processing Encryption 

so Standard (FIPS) in the summer of 2001 . It will replace DES in applications such as IPSec protocols, the Secure Socket 
L^yer (SSL) protocol and in ATM cell encryption. In general, hardware implementations of encryption algorithms and 
their associated key schedules are physcally secure, as they cannot easily be modified by an outside attacker. Also, 
the high speed Rijndael encryptor core and Rijndael encryptor/decryptor core presented herein, should prove benefk:ial 
in applk:ations v4iere speed is vital as with real-time communications such as satellite communications and electronk: 

55 financial transactions. 

[0083] In the foregoing descrptton. the preferred implementation Is on FPGA. It will be understood that the apparatus 
of the invention may altematively be implemented on other conventional d vces such as other Programmable Logk: 
D vk:es (PLDs) or an ASIC (Applicati n Specific Integrated Circuit). In an ASIC implem ntation, th LUTs may be 
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Implemented in conventional manner using, for example, standard RAM or ROM components. 
[0084] In the preferred embodiment described above, the ByteSub module 52 is implemented using LUTs while the 
ShiftRow and MixCol modules 54, 56 are implemented in (ogle. In an alternative embodiment of the invention, the 
ShiftRow and MixCol transformations are also implemented as LUTs rather than using logic. This embodiment is based 
s on the following equation (modified from the equation provided In the Rijndael specification): 



where, 





'S[a]* 02 




'S[a]» 03 




' S[a] 




' S[a] 


15 

T = 


S[a] 




5[al»02 




5[a]»03 




S[a] 




S[a] 




S[a] 




S[a]* 02 




5To]»03 




S[a]*03 




S[a] _ 




S[a] 




5[a]»02 
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and the State is denoted, 





j = o 


j = 1 


j = 2 


j = 3 


1 = 0 


Bo.o 


B0.1 


Bo;! 


B0.3 


i = 1 


Bi,o 


B1.1 


Bi;? 




i = 2 




B2.1 


82^ 


B2^ 


i = 3 


B3.0 


B3.1 


83^ 


B3.3 



and the significance of a, b, c, d, e and k can be seen from Figure 12a. 
^ [0085] Figure 13 illustrates schematically a Rijndael round module 144 comprising the components required for 
encryption. During Encryption three different sets of LUTs are required. Each LUT (labelled 'LUT in Figure 13) in the 
first set of LUTs comprises values as shown in Figure 7. Each LUT (labelled 'LUT_02* in Figure 13) in the second set 
comprises values as shown in Figure 14. 

Each LUT (labelled 'LUT_03' in Figure 13) in the third set comprises values as shown in Figure 15. The LUTs may 
^ conveniently be implemented as BRAMs in the manner described above with two LUTs implemented by each BRAM. 
Thus, in Figure 13, each set of LUTs comprises eight BRAMs 160. The outputs of the BRAMs 160 are combined in 
accordance with equation [5] using XOR gates 166. The round key addition is also perfomned by XOR gates 166 (see 
also Fig. 12a). 

[0086] In the round module 144, 24 Block RAMs are required to implement each Rijndael round to perform an en- 
^ cryptton. 

[0087] The design for decryption is not as straightforward as the encryption design. The decryption design is based 
on the following equation (this equation is not provided in the Rijndael specif k^tion): 



50 



55 









II 











bTvS[OE»\j®OB»\j ®0D»b2j ®09»bjj 

InvSlOD • boj,^ © 09 • b,j,^ © OE • 6,^, ^ © 05 • 
InvSlOB • fro j„ © OD • b,j,, © 09 • © OE • 6,^,, 



[6] 



12 



EP 1 246 389 A1 



where the significance of a, b, c. d, e and k can be seen from Figure 12b. 

[0088] Figure 16 shows a schennatic diagram of a round module 144' for performing decr/plion. Round key addition 
is performed firstly (see also Fig. 12b) using XOR gate 167. Five LUT sets are required (each set comprising eight 
LUTs in this example, although only two of each set are shown in Figure 1 6). Each LUT in the first set of LUTs (labelled 

5 as 'InvLUr in Fig. 16) contains values as shown in Figure 10. The LUTs required for the four other LUT sets each 
comprise values given respectively by (OE • bfj), (OB* bfj), (OD • bfj) and (09 • b^Ji, For example, the LUTs labelled 
•LUT.OE' have values corresponding to (OE • b^J and are constructed by multiplying every possible byte from 'GO' to 
•ir by 'OE' The LUTs labelled 'LUT_OB'. 'LUT^OD* and 'LUT_09' are constructed in a coresponding manner. Thus, 
'LUT^OE* contains the values (OE* b^J, 'LUT^OD* contains the values {OD • bjjj, 'LUT.OB' contains the values (00 • 

10 bfj) and 'LUT_09* contains the values (09 ♦ bfj}. The outputs of the LUTs are combined in accordance with equation 
[6] above using XOR gates 166'. Forty BRAMs per round module 144' are required. 

[0089] The round modules 144 and 144' may readily be combined to form a single round module (not shown) for 
implemented either encryption or decryption depending on the setting of the enddec signal. The combined round 
module is implemented using 40 BRAMs. During encryption only 24 of these are In operation, while during decryption 

15 they are alt in operation. Forthe final round, only 8 Block RAMs are required since the final round does not include the 
MixCol transfomiation. Furthemnore, the LUTs in the combined round module may be initialised using the general 
selectable ROM stmcture shown in Figure 11 (although the round module now comprises 40 BRAMs as opposed to 
the eight shown in Fig. 11). Eight additional ROMs are required to initialise the combined encryption/deciyption round 
module, 3 for encryption (i.e. containing values for LUT, LUT_02 and LUT.OS) and 5 for decryption (i.e. containing 

20 values for InvLUT, LUT_0E, LUT_OB, LUT_0D and LUT_09). The key schedule requires 20 RAMs. Hence, the entire 
apparatus utilises 396 BRAMs. It may be said therefore that there is a first set of LUT values for encryption and a 
second set of LUT values for decryption. The first set containing three sub-sets of LUT values, a respective sub-set 
for each of LUTs LUT, LUT_02 and LUT_03, and the second set containing five sub-sets of LUT values, a respective 
sub.set for each of LUTs InvLUT, LUT.OE, LUT_0B, LUT_0D and LUT_09. 

25 [0090] It will be seen from the foregoing that the preferred embodiment requires fewer BRAMs than the altematlve 
embodiment. This is partcularty advantageous when implementing the apparatus in FPGA or other target device where 
the available resources are limited or where It is important to keep size to a minimum. 

[0091] The invention is not limited to the embodiments described herein which may be modified or varied without 
departing from the scope of the invention. 

30 
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Claims 

55 

1 . An apparatus (40) for seledably ncrypting or decrypting data, the apparatus being an-anged to receive a control 
signal for selecting b tween ncryption and decryption, the apparatus (40) comprising at least one data processing 
module (44, 144, 144') an^nged to perfonm one or more data encryption or data decryption operations depending 
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on the setting of said control signal, wherein at least part of said data processing nnodule (44, 144, 144') comprises 
one or more programmable Look-up Tables (LUTs) (60, 1 60. 1 60'). the apparatus (40) further comprising at least 
one storage device (92. 94) for storing a first set and a second set of LUT values, the apparatus (40) being an^anged 
to program some or all of said LUTs with said first set of LUT values when said control signal is set to encrypt, and 
5 to program some or all of said LUTs with said second set of LUT values when said control signal is set to decrypt. 

2. An apparatus as claimed in Claim 1 , wherein the apparatus comprises a plurality of LUTs each of which is pro- 
grammed with said first set of LUT values during encryption and programmed with said second set of LUT values 
during decryption. 

10 

3. An apparatus as claimed in Claim 1 , wherein the apparatus comprises a plurality of LUTs and the first and second 
sets of LUT values each comprise a plurality of respective sub-sets of LUT values, and wherein, during encryption, 
some of LUTs are programmed with a respective one of the sub-sets of LUT values belonging to said first set and, 
during decryption, all of the LUTs are programmed with a respective one of the sub-sets of LUT values belonging 

IS to said second set. 

4. An apparatus as claimed in any preceding claim, wherein the apparatus comprises a plurality of instances of a 
data processing module arranged in a data processing pipeline. 

20 5. An apparatus as claimed in any preceding claim, wherein the apparatus is an-anged to perfonn encryption or 
decryption in accordance with the Rijndael Block Cipher, and wherein the data processing module is an-anged to 
implement a Rijndael round. 

6. An apparatus as claimed in Claim 5. wherein the data processing module is an-anged to implement the ByteSub 
25 transformation of the Rijndael round in at least one LUT 

7. An apparatus as claimed in Claim 5 or Claim 6, wherein said first set of LUT values is arranged to program a LUT 
to implement the Rijndael ByteSub transformation and said second set of LUT values is an-anged to program a 
LUT to implement the inverse of the Rijndael ByteSub transformation. 

30 

8. An apparatus as claimed in any preceding claim, wherein the data processing module includes a respective LUT 
for each byte of an Input data block. 

9. An apparatus as claimed in Claim 5. wherein the entire Rijndael round is Implemented using one or more LUT. 

35 

10. An apparatus as claimed in any preceding claim, wherein the first and second set of LUT values are stored in 
respective first and second storage devk^s and the apparatus further includes a 2-to-1 selector switch operable 
by said control signal to select said first storage devk;e when the control signal is set to encode, and to select said 
second storage device when the control signal is set to decode. 

40 

11. An apparatus as claimed In Claim 10, wherein the apparatus includes two or more sets of a first and a second 
storage device, each first and second storage device storing said first and second set of LUT values respectively, 
the apparatus further including a respective 2'to-1 selector switch for each set of first and second storage device. 

45 12. An apparatus as claimed in Claim 10 or 11 , wherein said first and second storage devk;es are implemented by 
means of respective Read Only Memories (ROMs). 

1 3. An apparatus as claimed in Claim 1 0 or 1 1 , wherein said first and second sets of LUT values are stored in respective 
storage locations of a single storage devk» and are selectable by a 2-to-1 selector switch. 

50 

14. An apparatus as claimed in any preceding claim, wherein each LUT Is innplemented by means of a programmable 
Random Access Memory (RAM) or a programmable Read Only Memory (ROM). 

15. A computer program product comprising computer useable instructions arranged to generate an apparatus ac- 
55 cording to Claim 1 . 
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- GF(2^8) 8-bit Multiplier Block* 

library IEEE; 

use IEEE.stdJoglc_1164.ALL; 
use IEEE.stdJoglc_arith.ALL; 
use IEEE.std_logrcLunslgned.ALL; 

package MultiplierTypes is 

type resSbItU is aiTay(0 to 7) of stdJoglc_vector(8 downto 0); 

type resSbitS is arT3y(0 to 8) of stdJoglc_vector(8 downto 0); 

type resTbitT is array(0 to 7) of std Jogic_vectorf7 downto 0); 
end package: 

library IEEE; 

use IEEE.stdJoglc_1164.all; 
use IEEE.numeric_std.all; 
use work.MuitipllerTypes.all; 

entity Multiplier is 

port( AB : in stdJogic_vector(7 downto 0); 
C : out std_k)glc_vector(7 downto 0)): 

end Multiplier; 

architecture MulUpOerSynth of Multiplier Is 

signal S : resBbitS; - an^y(0 to 8) of stdJogic_vector(8 downto 0) 

signal U : resSbitU; - anay(0 to 7) of std_k>gic_vector{8 downto 0) 
signal T : res7bitT; - arrayjo to 7) of std_toglc_vector(7 downto 0); 

signal 2 : res7bltT; - array{0 to 7) of std_logic_vector(7 downto 0); 

begin 

PI : process(S,U.TAB) 
begin 

s(0) <= 'O'&A: 

for i in 0 to 7 kx)p 

ifS(i)(8) = 'rthen 

U{0 <= S(i) xor VAx^lb"; -1b' Is Rijndaers Irreducible polynomial of degree 8 
T(i)<=U(i)(7 downto 0); 
else 

T{0 <= S(i)(7 downto 0); 
U(l) <= "000000000"; 
end if; 

s(i+1)<=T(l)&*0'; 

lfB(0 = *rthen 

Z(i) <=T{i); 
else 

Z(l) <= )rOO"; 
end if; 

end loop; 
end process: 

C<= Z(0) xor Z(1 ) xor Z(2) wr Z(3) xor Z(4) xor Z(5) xor Z{6) xor Z(7) ; 

Fig 8 
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